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Analysis and Pattern Recognition in Large, Multidimensional 
Data Sets Using Low-Resolution Data Grouping 

A portion of the disclosure of this patent document contains material 
which is subject to copyright protection. The copyright owner has no objection to 
the facsimile reproduction by anyone of the patent document or the patent 
disclosure, as it appears in the Patent and Trademark Office patent file or records, 
but otherwise reserves all copyright rights whatsoever, 

1. Field of the Invention 

This invention relates to analysis and pattern recognition of data. More 
particularly, this invention relates to methods, systems and devices and 
combinations thereof for analysis and pattern recognition in large sets of 
multidimensional data using low-resolution data grouping. 

2. Background 

With the advent of computerization and the low cost of data storage and 
acquisition, people in many endeavors are now accumulating very large sets of 
data. For example, scientists in dmg and chemical companies now use automation 
to perform so-called high-throughput screening ("HTS") of chemical compounds. 
UTS uses automated, relatively low-cost techniques to obtain various items of 
information about chemical compoimds. The goal of using HTS is to obtain 
information about a very large nimiber of compounds in a quick and relatively 
low-cost manner. Having accumulated a very large HTS data set, it is necessary 
to evaluate the data in order to detemiine which, if any, of the analyzed 

1 
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compounds warrants further investigation. However, the results of such HTS tend 
to be very large sets of multidimensional data, on tiie order of thousands of rows 
and do2:ens of columns, and so it is very difficult to make decisions just by 
looking at the data. 

In addition to the very large amounts of data produced by HTS, difiBculties 
in existing data handling and analysis methods include the following: 

• Data comes from very diverse sources, including HTS laboratories, 
physical measurements, bio-scientists' laboratories, various computational 
software programs, etc., and the different sources tend to have very diverse 
kinds of output including numbers, text, mixed data types, error notations, 
blank data, replicate data (more than one value per compound), etc. 

• Not all sources produce data on the same list of compounds, or in the same 
order. 

• Some data values are misleadingly too precise, i.e., have high relative 
experimental errors or noise, and can easily be over-interpreted. 

• Medicinal chemists have to weigh very different kinds of factors (for 
example, molecular weight vs. dose-responsiveness vs. ClogP vs. 
secondary biology vs. selectivity across screens) in trying to determine 
which are the best compoimds or clusters of compounds to which to 
devote further work. 
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Summary of the Invention 

This invention solves the above and other problems by providing 
automated tools to help vnih and speed up these data handling and analysis 
processes. These tools embody some assumptions about how the data should be 
treated by internalizing the most generally acceptable assumptions, but leaving 
more idiosyncratic decisions to individual users. 

A central concept on which this invention is based is grouping data into a 
relatively small number of categories using low-resolution data grouping. The 
grouping is visualized by assigning colors to data groups, e.g., in spreadsheets. 
Grouping of data potentially changes the precision of the data. 

This categorization of data has several major benefits, including: 

• creating a visual means of finding data patterns; 

• beneficially blurring small variations in numerical data that are, in 
practice, excessively fine distinctions, possibly due to experimental 
error; and 

• providing, in the colors themselves, a means or "common currency" to 
evaluate candidates across a wide range of data types. 

Accordingly, in one aspect, this invention provides mechanisms to 
expedite pattem recognition in large sets of multidimensional data, such as those 
that chemists assemble when evaluating hits firom high-throughput screening 
(HTS) and deciding which ones will get priority for fiirther investigation. In 
controlled trials, this invention has reduced the time to evaluate real data sets, 
firom days of intense human effort, which is vulnerable to errors due to volume or 
fatigue, to a few minutes of automation with graphical presentation of results. 



wo 01/08039 




PCT/USOO/20401 



It quickly becomes obvious upon using the system that the tools also have 
value in data-handling areas other than HTS. Examples include selection and 
management of any kind of tabulated data, e.g., portfolio management for any 
kind of rated portfolios, selection of drug candidate compounds, selection and 
management of proteins that are candidates for targets for drugs, selection and 
management of research projects competing for resources, and evaluating 
employee performance or job candidates. 

The system of this invention includes a new special command menu, a set 
of graphical user interface worksheets, and action buttons to facilitate the coloring 
and color analysis processes for the user. While the central process is the data 
grouping and coloring, there are also new tools for the upstream, or pre-grouping 
and coloring processes of importing, assembling, regularizing, and characterizing 
data in a spreadsheet, and for the downstream processes of visualizing, scoring, 
comparing, and sorting large amounts of color-coded data. The data-grouping and 
spreadsheet-coloring tool is presently implemented with a flexible, powerful, and 
convenient user interface that does not require knowledge of spreadsheet macros 
or of the Visual Basic language (used for the system's implementation). 

Accordingly, this invention provides methods, systems and devices for 
operating on data. 

In one aspect, the method of this invention provides at least one user- 
defined grouping rule for grouping the data into a user-definable number of 
groups. At least one of the grouping rules is applied to the data. The data may be 
provided in a table and the grouping mle applies to at least one user-selectable 
colimin of the table. In some embodiments, the grouping rule defines breakpoints 
corresponding to the user-definable number of groups. Application of the rule the 
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data divides fhe data into groups based on the breakpoints. The method may 
include presenting the grouped data in a manner that visually distinguishes the 
groups. In some embodiments, the grouping rules associate colors with groups 
and the grouped data is presented with an aspect of the data colored according to 
the rules. 

Sometimes the data are in labeled coliunns in a spreadsheet, and the 
grouping rule specifies at least one breakpoint and a corresponding color for each 
range defined by the breakpoint. The grouped data are presented by coloring each 
data item in one labeled column of the data based on the breakpoint and the 
corresponding color of the breakpoint. 

The breakpoints may be numeric or textual values. In some embodiments, 
the breakpoint is detemuned automatically based on the data. 

Sometimes the data are provided in a table, and backgrounds of table cells 
are colored according to the rules. 

The number of groups may be fewer than a number of possible data values. 

In another aspect, this invention is a method of operating on data by 
providing at least one user-defined grouping rule for grouping the data into a user- 
definable number of groups. The grouping mle is applied to the data to generate 
grouped data. At least one user-defined scoring rule is used to score grouped data 
according to user-defined scores. The scoring rule is applied to the grouped data 
to score the grouped data. 

In yet another aspect, this invention is a method of operating on data, in 
which data are grouped by applying to the data at least one user-defined grouping 
rule for grouping the data into a user-definable number of groups. The grouped 
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data are scored by applying to the grouped data at least one user-defined scoring 
rule for scoring the grouped data according to user-defined scores. 

In some embodiments the data can be a number of parameters for each of a 
number of cases and the scoring rule comprises a scoring function of user- 
selectable parameters and user-defined weights for the selected parameters to be 
used in scoring the cases. The scoring applies the fimction to the data to obtain a 
score for each case. Sometimes the method includes sorting the scored cases by 
score, individually or by cluster, as described below. 

The notion of clustering is that subsets of the various cases may be 
associated into clusters by having identical entries in any user-selected column of 
data, known as a clustering column. In some embodiments of the invention, the 
integrated clusters are treated by averaging the properties of all the cases which 
comprise each cluster. 

Thus, according to aspects of this invention, in order to facilitate analysis 
and pattem recognition in large, multidimensional data sets, the precision of the 
data is potentially changed (implemented, e.g., by grouping the data) and then the 
data are presented for visualization (implemented, e.g., by coloring the data). 

Brief Description of the Drawings 

This file contains at least one drawing executed in color. Copies of this 
patent with color drawings will be provided by the United States Patent and 
Trademark Office upon request and payment of the necessary fee. 

The above and other objects and advantages of the invention will be 
apparent upon consideration of the following detailed description, taken in 
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conjunction with the accompanying drawings, in which the reference characters 
refer to like parts throughout and in which: 

Figure 1 shows a typical computer system on which the present invention 
operates; 

Figure 2 shows an overview of the functionality of the present invention; 
Figures 3A-3B depict a display of data in a spreadsheet; 
Figures 4A-4B show a color control rules worksheet according to one 
embodiment of the present invention; 

Figures 5A-SB show data coloring rules; 

Figures 6A-6C show a data coloring control panel and a flow chart of the 
data coloring process, respectively; 

Figures 7A-8B show the worksheet of Figure 3A and 3B after various 
coloring rules in Figure 4A have been applied; 

Figures 9Ay 9B, lOA, and lOB depict displays of data in spreadsheets; 

Figures llA and IIB show the form of the cluster control worksheet 
according to one embodiment of the present invention; 

Figures IIC-IID shows control panels from the cluster control 
worksheet of Figures IIA-IIB; 

Figure 12 shows the enlarging of the cluster starts mechanism according 
to one embodiment of the present invention; 

Figures 13A-13D show the application of vertical display re-scaling 
according to one embodiment of the present invention; 

Figures 14A-14D and 15A-15B show the application of the scoring and 
sorting of clusters according to_one embodiment of the present invention; 
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Figures 16A-16N, 16P and 16Q show aspects of the application of the 
dose-response scoring and estimation of potencies according to one embodiment 
of the present invention; 

Figures 17A-17B show the application of the sheet statistics tool 
according to one embodiment of the present invention; 

Figures 18A-18D show the application of the scoring and sorting of 
clusters for the purpose of project prioritization and management according to one 
embodiment of the present invention; 

Figures 19-24 show examples of the application of this invention to 
various types of data; and 

Figures 25 and 26 show application of an aspect of this invention. 

Detailed Description of the Presently Preferred Exemplary 

Embodiments 

Overview 

Figure 1 shows a typical computer system 100 on which the present 
invention operates. The computer system 100 includes a processor (CPU) 102 
connected to a memory system 104 and a display 106. The computer system also 
includes various input devices including a keyboard 108 and a mouse 110 or other 
pointing device. Internal storage 112 (e.g., a hard disk, a CD ROM and the like) 
and external storage 114 (such as a floppy disk, CD ROM and the like) are also 
provided. 

Various aspects of this invention are implemented as computer software 
programs or algorithms 116 which run on the computer system 100. The software 
programs 116 can reside in the internal storage 112, the external storage 114, 
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and/or in the memory 104. The software programs 116 operate on data 118 which 
is provided, e.g., on the external storage 114. The software programs 116 operate 
in a standard and known manner by being executed on the processor 102 of the 
computer system 100. 

In some embodiments of the present invention, the user can create and 
modify various executable rules 120 which can operate on the data 118. For the 
sake only of explanation, the rules 120 are depicted separately from the data in the 
figures. However, as explained in more detail below, some or all of the rules 120 
can be part of the data 118. 

In preferred embodiments, the computer system 100 is capable of running 
the spreadsheet program Excel™ 95 (hereinafter "Excel'O from Microsoft 
Corporation, and the software computer programs 116 are written in Microsoft 
Corporation's Visual Basic (hereinafter "VB") and are provided as an add-in to 
Excel. A single copy of software thus serves all data files on a particular 
machine. To conserve Excel resources, in some embodiments, the package self- 
installs the add-in when the user opens a data file, and un-installs the add-in when 
the last data file in memory is closed. 

In a preferred embodiment, this invention works entirely within the 
environment of EXCEL. Excel structures data files as workbook files which 
contain worksheets. The programs 116 of this invention consist of special EXCEL 
worksheets, called control sheets, on which input data is written by the user into 
designated labeled cells. The control sheets are part of the same Excel workbook 
file as the data. The control sheets also contain action buttons to execute the 
various procedures associated with this invention. The rules 120 are formed by 
setting various parameters in the control sheets. 
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When the workbook file is saved, the parameters (for the rules 120) are 
stored on the control sheets along with the data, and they can be modified and/or 
re-executed at any time without having to re-enter anything. The results of 
operations are automatically written as worksheets in the same workbook file, 
providing a convenient, integrated data environment in a single file. 

The system according to the present invention operates, in one aspect, in 
accordance with FIGURE 2. Recall that the user's aim is to perform analysis and 
pattern recognition in large, multidimensional data sets using (potentially low 
resolution) data grouping. To this end, the user and/or the system will create rules 
for coloring and presenting the data First (at 122) a user creates and organizes the 
data 118. Various tools (discussed below) are provided to aid in the creation and 
organization of the data. Then (at 124) the user creates rules 120 for operating on 
the data 118. The rules 120 can be created before or after the data 118, rules can 
be reused for different sets of data and multiple rules can apply to the same data. 
The creation and operation of rules are discussed in greater detail below. Once the 
data 118 and the rules 120 are created, the user then selects some (or all) of the 
rules to apply to the data (at 126). Specifically, the user groups and thereby colors 
the data according to selected rules. With the data grouped and colored according 
to the rules, the user can then perform group/color-mediated data mining (at 128). 

Figures 3A-3B show views of the program of this invention in operation 

v\dth a sample Excel sheet 300, denoted "DEMO 1" (302) containing data (not all 

the data in the sheet is visible). The views of Excel worksheets shown in the 

various figures and examples that follow are the views that are presented on the 

display 106 of the computer system 100. Sheets in an Excel workbook are 

labeled with tabs at the bottom of the worksheet. The data on the "DEMOl" sheet 

10 
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300 consists of eight columns of data for each of a number of compounds. The 
compounds are denoted "Cmpdxx", where "xx" ranges from "01" to the number of 
compoimds. In Figure 3B, the last compound visible on the data sheet is 
"Cmpd58". The eight columns are headed: 

1. "Cmpd" (column A); 

2. "Series" (column B); 

3. "Testl" (column C); 

4. "Test2" (column D); 

5. "Test3" (colunm E); 

6. "HTS SPA Dose-Resp % Inhib @3xlO-6M" (colunm F); 

7. "HTS SPA Dose-Resp % Inhib @ 1x1 (column G); 

8. "HTS SPA Dose-Resp % Inhib @ 3xlO-7M" (column H); and 

9. "HTS SPA Dose-Resp % Inhib @ Ixl0-7M" (colunm I), 

In addition to the "DEMO I" worksheet 300, the Excel workbook shown 
in Figures 3A and 3B has seven other worksheets, denoted "DEMO 2'* 304; 
"DEMO 3" 306; "clusterinfo DEMO'' 308; "Append Control" 310; "Color 
Control" 312 and "Cluster Control" 314. The last three worksheets, denoted 
respectively "Append Control"; "Color Control" and "Cluster Control," contain 
various mles and controls (to be discussed below). The data in worksheets 
denoted "DEMO 1" 302; "DEMO 2" 304; "DEMO 3" 306; and "clusterinfo 
DEMO" 308 correspond to data 118 (Figure 1) and the controls or rules in the 
worksheets denoted "Append Control" 310; "Color Control" 312, and "Cluster 
Control" correspond to the rules 120 (Figure 1). 

Figures 4A-4B show a color control rules worksheet (312, denoted "color 

control") according to the present invention, as displayed on display 106 of the 
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computer system 100. TTe color co„«>l workshe« 31J is sho«n «i* some rules 
to puce. i.e.. havtog values se, and ote rules left blank. These rules are 
sho«n as e>^ples only. and. as wift any of fte ofter w=s of rules, any or all of 
fl« mles can be se. by user. A wical data coloring rule 130 is sho«n to 
. F.G«RE 5A. The rule 130 tas aheady been set up and operates on the appropriate 
data when selected byauser(usmgmousellO.F,<=.BElor some other pomting 

device)tothe area 132 marked -CUclcheretorunthese-.Tl^ruleUOCaswithaU 
of a.e prefe^d color control rules) has four parts, namely the name of the sheet 
,34 contatotag the data on which the rule is to operate CDEMO 1" to the example 
„ of F.G«KE 5A); the columns 136 of dataof the sheet on whichth. rule is to 

operate ("E" m the example of FlGUBE 5A); the number of colors 138 to be used 
by the rule 130. a number of br«kpotots 14. (d«K«ed "break 1" to "break 4" m 
^ example of F.GURE 5A); and a crrespondtog number of colors 142 for each 
,angedeflnedbytheb,eakpoto.s(deno,ed"colorl"tocolor4»md«exampleof 

,s F.GURE5A). AcU^l,.asexplatoedbelow.then„mberofbreakpomtsiso„eless 
ton «,enumberof colors.. In the spedfic example shown mFlGUtu^SA. the mle 
has three breakpotots of l.Sand 10. deflntog four ranges with iburcorrespondrng 

colors 142. namely light gteen. yellow, otange and red. Preferably the named 
colors are also depicted to the actual colors, so that, to this example, the 
« background of the word "Ughtgreen" is shown m light green, the background of 
the word "yelloW" is shown to yellow and so on. 

In flns tovention it is p«ferable to show data and metadata (headings etc.) 
to color, to some embodta«nts. the coloring is taiplemented by showtog a 
backgbund area of the text representing the datatothe.appmpriate color, 

Somette.es the actual text representmg the data is shown m the appropriate color. 
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15 



20 



U.p^U,p«fe„ed«nbodi.e„.s>.fon.co.oHso..,changedi.c^«he« 

e^boaimcnt Combi.«ions.fbc*.^hescan^use.. For example. *e 

^.^^ section of wo«. •^Uo«" is preferably sho«n in *e color yellow. 
,Hs aiso possible «, shov. *e word itself. i.e., *e fon.. in U>e color yenow. as iong 

as Aat color is distinguishable fiom the background. 

Theparticular rule 130 shown inf1o™ES4AandSA. operates asfoUows. 

when selected: 

tosheet-DEMO 302. in colunm E, values less than or equal «. 1 (break 
oar. colored UghtgreenMor 1); values in the range 1 to 5 (between break 1 
andbreakijarecoloredyellowtcolor 2); values in,herange5«>.0(br.ak2.o 

, ^ ^ rnnior 3V and values greater than 10 (break 3) are 
break 3) are colored orange (color 3), ana vaiucs 

colored red (color 4). 

Another typical data coloring rule 130-1 ftom the color con«..l sheet 312 
is Shown in no««. 5B. The rule 13<^1 is set up to operate on colunu« "C" and 
..ff. of sheet "DEMO 1". The rule 130-1 uses three (3) breakpoints (breakl=0.1. 
Wc2=.andbre.k3=5)deflningfour ranges withfourC4)cor,esponding colors 

("lightgreen", "yellow", "orange", and "red"). 

•n^resultsofapplying the rule 130-1 of F.G«»5Btothedatainsheet 
"DEMO r (302.F,C.;RE3)areshowninF.C»«.=s7A-7B. As can be seen from 
7A-7B. after application of the rule 130-1. aU ofthe data in colunuts C 
3«, D of the sheet labeled "DEMO 1" has been colored according «> the rule. 

Sp.cificah,.dataha™gava«lesSthan-or«p-»b.-^^C»'>'«-^ 
coloredU^gr^^datavaluesintherangebetweenbreak 1 and break 2 (0.1 to 
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1) have been colored yellow; data values in the range between break 2 and break 3 
(1 to 5) have been colored orange; and data values greater than break 3 (5) have 
been colored red. 

The results of applying all of the other color control rules shown in 
Figures 4A-4B to the data in sheet "DEMO 1" are shown in Figures 8A-8B. 
The rules can be applied individually (as shown above with respect to 
Figures 7A-7B), or they can be all be applied at the same time. In order to apply 
all rules to a particular data set (sheet), each rule can be individually selected or 
the area labeled "Re-run ALL rules for sheet named DEMO 1" (on the right 
side of Figure 4A) can be selected. Note that if two rules apply to the same 
colmnn of the same sheet, the second rule run on that column will override the 
first rule run on that column. 

To create a coloring rule a user performs the following (with reference to 
Figure 6B): 

(1) Select the "COLOR Control" sheet 312 and pick a control panel on 
that sheet to use (an empty panel or one containing a rule no longer 
needed) (at 600). All control panels on a sheet can be cleared by 
clicking the button labeled "Clear all entries on this sheet" 
(318 in Figure 4A). 

(2) In the selected control panel, enter the name of the sheet to be 
colored (at 602). 

(3) In the selected control panel, enter a column or columns (at 604). 

For multiple columns, either list them separated by commas, or use 

a colon-or hyphen to denote ranges, or some combination. For 

example, "A:D,F" means columns A,B,C,D, and F. To aid in 

14 
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ehoosin.co.»nu>s.*e-rcanrigh..cU*on*ec.U confining 
^^.of*e«a*e«.a«ipick"Op»T«inScr^n"fromfl.= 
shornu, m»» *«. app-s. a special dual display. This 

also creaBs a "Close Twin Soreen" button «. go back. 
(4)Chooscan«n.berofcolors,ouse(a.*««).ei*erb,en.eringto 

„u„*„ofco,orso,by«p.a,ealyc.iokingO,egra,bu„onadiac«« 

fte cell labeled of colors". In prefeted embodiment, fte 
3,,^ dlo>.s for five breakpoint and six colors per rule. 
Accordingly. U« numbers wiU cycle ftonr 1 .o 6. and various cells 

below them wUl be blacked out accordingly. 
(5)a«er*ebreakpoin.*a.define*ecolorgroups(a.««).inany 

of three modes: 

the breakpomts, i.e.. the boundaries between the 
color groups, one less than the number of colors, in 
increasing numerical order. Cells whose values 
exactly equal a breakpoint value wUl be colored 

with the lower group (breakpoint 1 is colored with 

color l,etc.) 

20 ^ 

"log", or "count" as the first breakpoint. If multiple 
columns have been chosen, the user must also enter 

•♦^ «TJ crale alP" at the bottom 
"yes" or "no" opposite Re-scaieaii. *vt" 

oTthe panel, to-indicate whether eachcolynm 
should get its own auto-breakpoints, or whether the 

25 
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5 



auto-breakpoints of the first column (first in list in 
the rule, not first on the data sheet) should be used 
foraU. 

This mode reports information about the 
breakpoints it determines, and thus could also be 
used to explore the distribution of numerical values 
in a column prior to a final manual breakpoint 
selection. 

c) Text data : enter the strings to be matched and 

colored, in preferred embodiments, up to five (5) in 
number. Matching is case-insensitive unless the 
string is enclosed in double quotes (" and 
otherwise, no quotation marks are necessary. 

Several special text strings act as operators if entered as 

the first word in a rule cell: 



Rule Entry 

(OPERATORS need not 
be uppercase — ^here only 
for emphasis) 


Meaning 


test string 


color data cell if its whole content matches the test strmg 


NOT test string 


• color data cell if its whole content does not match the 
test string; 

• will not color numeric cells 


CONTAINS test string 


color data cell if contains the test string as a substrmg 
anywhere 


NOTCONTADMS test 
string 


• color data cell if it does not contain the test string 
anywhere; 

• will not color numeric cells , 
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Rule Entrv 




(OPERATORS need not 
be uppercase — ^here only 
for emphasis) 


Meanins 


BEGINS test string 


color data cell if it begins with the test string 


ENDS test string 


color data cell if it ends with the test string 


* (an asterisk) 


(wildcard) color data cell containing any data, including 
numeric cells 


BLANK 


color data cells that are blank 



Using quotes to force matching to be cEise-sensitive also 

works with strings that follow an operator. 

It is possible to construct a text-coloring rule in 
which certain cells may satisfy more than one of the 
"breakpoint" values. For example, if a rule says that 
"active" is colored green and "contains act" is 
colored red, then the word "active" in a cell would 
satisfy both. In such cases, the system colors the 
cell according to iho first condition satisfied on the 
list of breakpoints. This dependence on the 

order can be used advantageously to achieve 
complex coloring conditions. The sequence of 
conditions can be considered as a series of filters, 
through which only the sis-yet-uncolored cells fall 
through to the next decision. 
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(6) Enter the names of the colors to use (at 610), in the order 
corresponding to the breakpoints. A display of color samples is 
provided at the right side of the Color Control sheet 312. A user 
need only enter the name, and the appropriate cell will become 
colored when the tool is executed. If the user wants the color to 
display immediately, he can copy and paste the sample cell into the 
rule's color cell. A special pseudo-color named "SKIP'' is used to 
tell the system not to color the cells whose data falls in this group. 

(7) When the rule has been created, the user executes the rule by 
selecting the rule's "Click here to run these" button on the 
panel filled in (at 612, Figure 6C). 

(8) To create different coloring rules for other columns, repeat the 
above in additional control panels. If the user runs out of control 
panels, he can create more control panels by copying an existing 
one and pasting it onto a blank section of the color control sheet. 

To the extent that a single panel cannot hold all the requirements for a 
particular rule, a user can combine two or more panels to create a particular 
rule. For example, if a user needs ten (10) breakpoints, two panels can be 
used. 

With reference to the coloring rule is shown in Figure 6A, once the 
rule has been set, a number of parameters are stored in the system. The 
parameters are "sheet name" ("DEMO 3" in Figure 6A), column 
specification, number of colors, array of breakpoints, array of colors, and 
- multicolumn scaling mode. 
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The data coloring mechanism operates as follows, with reference to the 

flowchart of Figure 6C.: 

1 The user enters the parameters into a rule panel on a "Color 
CONTROL" worksheet 312. e.g.. as described above with reference to the panel of 
Figure 6A. 

2. The user selects (clicks) the activation button (labeled "Click here 
to run these") on that panel (at 612). This causes the system to: 

(A) Read and interpret the parameters from the panel (at 614). 
The system can identify which button was cUcked using the Visual Basic ("VB") 
"caller" property, described in more detail below. The parameters are then read 
based on the identity of the cell location of the button using the VB "TopLeftCell" 
property. The system retrieves the parameters (sheet name, column specification, 
number of colors, array of breakpoints, array of colors, and multicolumn scaling 
mode) from cells in this panel by relative reference to the button cell. 

(B) Next, determine the mode ofthe coloring rule (at 616) (i.e.. 

numeric v. text or manual v. automatic, and, if automatic, which of value, log or 
count). This uses the analysis ofthe first breakpoint entry. 

(C) CompUe a list ofthe columns specified in the "column 
specification" parameter (at 618). This is done by scanning the various areas 

20 contained in the selection, as follows: 



10 



For Each singlearea In Select ion. Areas 
For Each c In singlearea. Colunins 
If Not CountEn5)ty Then 

Ir = LastRowInColumnCc. Column) 

End If 
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If Not CoxintEmpty And Ir = 0 Then 
•skip this empty column 
ncols = ncols - 1 



Else 



add this column to the list 



icol = icol + 1 



colnumarray ( icol ) = c . Column 



End If 



Next c 



Next singlearea 



(D) If an auto-breakpoint mode is being used (determined at 



620), analyze the data values to determine the breakpoints (at 622). This is done 



(i) Collecting statistics on the data distribution in each 
specified column; and 

(ii) Calculating the automatic break points for the 
appropriate mode. For example, the auto-value breakpoints 
are determined as follows: 



If breakmode = "VALUE" Then 

interval = (maxvalue - minvalue) / ncolors 

break (0) = minvalue 

For ibreak = 1 To ncolors - 1 

break <ibreak) = break (ibreak - 1) + interval 
Next ibreak 



(C) Loop through the cells in the chosen columns on the chosen 



by: 



(iii) Displaying the results for user approval or 



cancellation. 



worksheet (at 624). 
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(D) Compare each ceU's value to the list of breakpoints (at 
626). If the coloring rule is in text mode, use the meanings of the special 
breakpoint operators ("contains", "blank", asterisk "*"; or quotation 
marks). 

(E) When a match is found, apply the appropriate color (at 

628). 

The code below illustrates the processes (D) and (E) for numeric 
breakpoints: 



10 
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25 



30 



For Each cell In Range (Cells (StartColoringRow, colnum) , 

Cells (FinishColoringRow, colnum) ) 

cvalue = cell. Value 
colored - False 
If isNumeric (cell. value) Then 
If Not IsEmpty(cell) Then 

' fhave to test Jboth because EMPTY is numeric) 
For ibreak = 1 To ncolors - l 

If cvalue <= break (ibreak) Then 
If color (ibreak) <> "SKIP" 

cell . Interior . Color Index = 

Color (ibreak) 
Call Text Contrast (cell) 

End If 

colored = True 
Exit For 
End If 
Next ibreak 
If colored = False Then 

' Not hit yet? Must be top category, so: 

If color (ncolors) <> "SKIP" 

cell, interior. Color Index = Color (ncolors 

Call TextContrast (cell) 
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End If 



colored 



True 



End If 



End If 



Else 



' not numeric - just don't color it 



End If 
Next cell 

The operation of the data coloring tool of this invention will now be 
described in greater detail. Each coloring rule is provided in a coloring control 
panel that has the general form of a coloring rule as shown in Figure 6A. In one 
preferred embodiment, each coloring control panel 144 is a double-outlined unit, 
sixteen (16) cells high by two (2) cells wide. As noted above, a user is provided 
with coloring control panels on the color control worksheet 312. A user can use 
the coloring control panel 144 to set the sheet and column(s) on which the rule is 
to operate, the number of colors, the various break points and the colors associated 
with those breakpoints. The sheet is set by entering its name into the cell 146 
adjacent the cell labeled "sheet". The column (or columns) on which the rule is to 
operate is (are) set by entering its (their) name in the cell 148 adjacent the cell 
labeled "column(s)". The number of colors is set by the user by selecting the cell 
150 adjacent to the cell labeled of colors". Each time the cell 150 is selected it 
increases the number of colors, up to a maximum of six (6), i.e., rotating through 
the values 1 to 6. Le., when the cell 150 shows a "6" and is selected, it reverts 
back to "r\ That is, selecting the cell 150 causes the value in the cell to cycle 
from "1" to "6" and then back to "1". 

A Visual Basic ("VB") macro function CCallColorColumn^^) is associated 
with the top cell 152 of the control panel 144. When the cell 152 is selected by 
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the user (with the mouse 110 or the like), the function associated with that cell is 
executed by the computer (CPU 102). In the presently preferred embodiments, the 
CallColorColumn function extracts the button name of the cell 152 and then calls 
a second function CCallColorColumnSubroutine") with that button name as one 

5 of the parameters. The fimction CallColorColumnSubroutim takes three 

parameters, namely ButtonName, StartColoringRow, and FinishColoringRow. 
The two parameters StartColoringRow, and FinishColoringRow are optional. 

First, the function CallColorColumnSuhroutim determines what specific 
values to use for the coloring by reading them fix>m the control panel 144. Since 

10 the values are all in fixed positions relative to the selected button cell 152 that 

initiated the call to the function CallColorColumn, the values can be determined 
once the location of that button cell 152 has been determined. This is done using 

the following Visual Basic code: 

Sheets { " Color Control " ) . Act ivate 

15 headingrow - 

ActiveSheet . Buttons (ButtonName) . TopLef tCell . Row 

headingcol = 

ActiveSheet .Buttons (ButtonName) . TopLef tCell . Column 

20 

Note that if the fimction CallColorColumnSubroutim was called from 
another sheet (not "Color Control") then this method will not find it. 
The various parameter values are then read as follows: 
Sheet name: 

25 datacol = headingcol + i 

sheetname = Trim (Cells (headingrow + 1. datacol) .Value) 



23 



10 



• PCTAJSOO/20401 

If there is no sheet named "sheetname" an error function is called. 
Generally, in preferred embodiments, a great deal of error checking takes place to 
ensure that the user is provided with a friendly and useable interface to the 
program. Most of the error checking is not mentioned in this description, 
however, one skilled in the art would know what kinds of error checking to 
implement in order to provide a user-friendly working environment. 

The column(s) to be colored are specified by : 

Cells {headingrow + 2, datacol) .Value 
The number of colors is specified by the variable ncolors, where: 

ncolors = Cells {headingrow + 3, datacol) .Value 
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Within the fimction CallColorColumnSubroutine there are two arrays, 
named break and color, which are used to store the breakpoints and colors, 
respectively. The first breakpoint is set as follows: 

break (1) = Cells (headingrow + 4, datacol) .Value 

The value of the first breakpoint is used to determine the break mode 
CTsaJMERIC, "VALUE", "LOG", or "COUNT*). l£break(l) (as determined 
above) is numeric, then the mode is set to "NUMERIC", otherwise, if break(J) is 
one of "VALUE", "LOG", or "COUNT", then the break mode is set to that mode, 
otherwise the break mode is set to "TEXT". 
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Next, the function determines whether multiple columns were specified, in 
which case it determines whether the user selected to re-scale all the columns. 
The user's re-scale selection is determined by: 

rescale_all_string = Cells (headingrow + 
15, datacol) .Value 

Now the rest of the breakpoints (if any) are read. If the break-mode is 
"AUTO" then the breakpoints are set as follows: 

For i = 2 TO lastbreaknum 
J break(i) = cells (headingrow + 3 + 1, datacol) .Value 

Various possible errors are checked for. E.g., if any breakpoints are 

missing (i.e., if break(D is empty, the user is notified. Also, if the break mode is 

"NUMERIC" and non-numeric breakpoints are set, the user is notified. If 

,5 numeric breakpoints are not in mcreasing order, the user is notified. As noted 

above, generally, in preferred embodiments of the present invention, a great deal 

of error checking is performed on all user mpats to ensure that the values are 

correct and consistent. Most error checking is omitted from this description. 

The CallColorColumnSubroutine maintains an array, colorname, which 

maps integers to colors. In preferred embodiments, there are fifty six (56) colors 

available. To use the higher numbered colors, the computer's video card must be 

set appropriately. Using the colorname array, the program next associates the user 

provided color names with index numbers. Specifically, for each of the (up to six 

in a preferred embodiment) colors specified, the user specifies an actual color 

25 name, denoted cname. This name is determined for each color by: 

For j = 1 To ncolors 

cname = Cells (headingrow + 8 + j , datacol) .Value 
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o , j datacol) . interior. Colorlndex - 
cells (headingrow + 8 + D , « 
Color ( j ) . 

where*, value of *evariab..y».g^ ftoml«.«»/- 
...ece.Hs^e.p™c.sea.,a^o.r»Co-^^.. 

With the parameters read and checked, me y 
....e..ec«a.ee.Cs^«— >• "^"-^"^^ 

one-by-one by the foUowing program code: 

call ParselnpuUXnString. "^^"^^^^ 
por Bach singlearea In Range (xnspecxfie 
" " Por Kach c in singlearea. Columns 

colnuni - c. column ^^n^rs break, 

call ProcessOneColutnnCcolnum. ncolors. 
^^^^ color, breakmode. 

rescale_all. sheetname, 
20 StartColoringRow, 

FinishColoringRow) 

Next c 
Next singlearea 

T.e,^i«.perfo™eab,«.e^«ion..c«^'— 
^eco,u»ae.^Wco,-»»o„^.— 

The d=sfgn«edco.>«mUcolored»on. *e row «-e^»*»8«o 
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variable rescalejiH 

call ;.u.<.Valu.Brealcpoin«(coln«, colletter, 
,0 breafc. Color, breal«.ae, res=.le_all> . 

0», if *e ^rea^o^ ^ sc. «. "COUNT' and *= value o, 

call .utoCountBrea^oi"" <=olnu», coll..«r, ncolors. 
Wi«.*ab,ea^ta«ca.cu,a«a.*eco,u™sa.co.<.^acco*g«.*e 
^ ^<,r.ALUB"."COUKr.-00..,o.-NUMEWC..«KS,«»«»«esa 

this is as follows: 
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call colorNumericColuinn(colletter. ncolors, 
break. Color, StartColoringRow, FinishColoringRow) 
Case "TEXT" 

call ColorTextColuinn(colletter, ncolors, break, 
5 color, StartColoringRow, FinishColoringRow) 

End select 

Then, when the rule in control panel 144 is selected for execution, the rule 
is applied to the selected colunm(s) (denoted in cell 148) of the named sheet (in 
cell 146). For each colunm in the named sheet, the value in each cell is compared 
to the various breakpoints and the cell is colored corresponding to the appropriate 
breal^oint. 

Examples of the appUcation of various coloring rules in the "Color 
COKTROL" worksheet of FIGURES 4A-4B, are shown with reference to the data in 
15 worksheet "DEMO 2" (depicted in Figures 9A, 9B, lOA and lOB). 

rnlnr-Mediated Data MiBing 

As noted above with reference to Figure 2. once the data have been 
colored according to the user-selected rules (at 126), the user can then perform 
color-mediated data mining (at 128). The presently preferred embodiment of this 
invention provides five mechanisms (each discussed below) for color-mediated 
data mining, namely mechanisms to: 

1 . enlarge/shrink cluster starts; 

2. vertically re-scale the display; 

3 . score and sort clusters; and 

25 4. score and sort dose-response data. 



28 



• tf^ PCTAJSOO/20401 

The following discussion refers to the cluster control worksheet which is 
shown in Figures IIA-IIB. 



1. Rnlaree/ShrinW Tluster Starts 

The "Enlarge Cluster Starts" mechanism highlights the first row of each 
cluster in clustered data by enlarging the font of the ceU containing the cluster 
number or label, thus enabling size reduction of the spreadsheet for the user to 
focus on the color patterns. When the ceU height is dramatically reduced in order 
to see more cells on a screen or printed page, this enlargement allows the user to 
still read the label at the beginning of each cluster. The mechanism takes user 
input from a Cluster Control worksheet. A corresponding mechanism ("SHRINK 
CLUSim STARTS") allows for undoing the enlarging. This mechanism handles 
cluster numbers or textual labels. Any column can be designated as the cluster 
labels to be processed. 

Operation of the mechanism is as follows: 

(1) From the "Cluster Control" sheet 314 pick a control panel to 
use (one which is empty or one containing inputs no longer 
needed). On this sheet, a single control panel extends vertically 
through the black, blue, red, and green sections, and provides input 
information for several tools. 

(2) In the blue section, enter a sheet name and the column to be 
considered as the cluster labels. 

(3) Click either the blue-text "Enlarge Cluster Starts" or "Shrink 
Cluster Starts" button. 
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The program code accomplishes this by scamiing the column of cluster 
labels, identifying any entries that are different from the one immediately above, 
and enlarging them. Code that carries out this function is shown below: 

For Each c In Range (Cells (3 , colnum) , Cells (lastrow, 
colnum) ) 

irow = c.Row - 1 

icol = c • Column 

If c. Value <> Cells (irow, icol) .Value Then 

c- Font. Size = bigfontsize 
' Rows (Irow + 1) .RowHeight = higrowh&±ght 

End If 
Next c 

Example 

An example of the application of the enlarge cluster mechanism is shown 
in Figure 12 which shows the application of a rule (shown in the control panel 
Figure IIC) from the cluster control worksheet in Figure IIB to the data of 
worksheet "DEMO 2" as shown after coloring in Figures lOA-lOB. As shown 
in Figure IIC, the rale is to be applied to column B of sheet "DEMO 2". 



2. Vertical Display Re-Scaling 

The vertical re-scaling mechanism operates by taking a user-provided scale 
factor and then changing height of data rows to facilitate visualization of large- 
scale color patterns. The mechanism leaves column heading heights and column 
widths unchanged. This makes headings remain readable and greatly simplifies 
examining long columns of data for color patterns. 
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Figures 13A-13D show the application of the vertical display re-scale 
mechanism according to the present invention. Figures 13A-13B show some of 
the data m the worksheet labeled "DEMO 3" 306 (Figure 13A shows the first 
thirty eight or so elements and Figure 13B shows the remaining elements of that 
worksheet). As can be seen from the figures, the worksheet "DEMO 3" 306 has 
three hundred and twenty eight (328) data entries (in rows 2-329). The user can 
vertically scale the display by selecting "Re-scale Vertical" from the system's 
special menu or by pressing a particular control key sequence (e.g., "CNTL- 
SHIFT-V" in a preferred embodiment). This presents the user with a dietlog box 
318, as shown in Figure 13C, which asks the user to enter a scaling factor 
relative to the current size. The user enters a scaling factor to enlarge or reduce or 
restore the display. In the example shown, the user enters a scaling factor of 0. 1 
which produces the vertically scaled display shown in Figure 13D. 

Vertical scaling allows a user to get an overview of the data, based on the 
coloring. 

The portion of the program code presented below carries out the central 
function of the vertical display rescaling mechanism: 

rowspec = "2:" & lastrow ' leaves the headings unchanged, 

i.e., readable 

If factor = -1 Then 

Rows ( rowspec ) . Rows . AutoFi t 

Else 

For irow = 2 To lastrow 

Rows (i row) .RowHeight = 

Rows ( irow) , RowHeight * factor 

Next irow 
End If 
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After execution of the rescaling mechanism, as can be seen in 
FiGUR£ 13D, the height of each row (except the heading rows) has been scaled by 
factor y 0. 1 in the example shown. In this manner, all rows of the data are made 
visible on a single page, thereby facilitating data analysis. 

3. Scoring and Sorting Clusters 

Scoring and sorting clusters assign numerical scores to the color patterns 
of individual rows or clusters of rows, thereby enabling comparison and sorting of 
the clusters by score. 

The scoring and sorting mechanism accepts user's designations of colors 
and corresponding relative scores. It handles cluster niunbers or textual labels. 
Any column can be designated as the cluster labels to be processed. The 
mechanism scores a user-selected list of columns of data, with user-assigned 
relative weights, which need not be equal for all colimms. 

User input is taken from a Cluster Control worksheet 314 (see 
Figures 11 A and IIB), which stores any number of parameter sets, each one 
with a user-specified name. 

The input data is automatically sorted by cluster label before starting, in 
order to group the clusters together in case the user has previously sorted the data 
by some other criterion. Then scores are normalized to remove the effects of 
cluster size, absolute magnitude of scoring points chosen, and absolute size of 
weights chosen. The results are written to two new worksheets without altering 
the original data sheet. The first derived sheet is for the nimierical scores; the 
second is like the original, but has the clusters sorted into descending score order, 
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SO that the "best" are at the top, removing the need to visually scan a long colored 
worksheet. The derived output sheets have names that indicate their source data 
sheet and the name of the parameter set used for scoring. At the user's option, the 
system reversibly hides the un-scored colimms in the cluster-sorted output sheet, 
focusing attention on the data that were used in scoring. 

In preferred embodiments, the system detects uncolored cells in the data 
and offers the user two programmed modes of dealing with them, (uncolored = 
entry on user's list of scores or uncolored = "average of other colors in row'')? or 
the option of stopping to color them manually. 

If the user designates a coliunn of individual compound labels as the 
"cluster labels," then the system compares single compounds rather than clusters. 

The mechanism operates as follows, with reference to Figures IIA-IIC. 

(1) On the "Cluster Control" sheet 314 the user picks a control panel (e.g., 
panel 1100) to use (a panel which is empty or one containing non- 
needed inputs). On this sheet, a single control panel extends vertically 
through the black, blue, red, and green sections, and provides input 
information for several tools. 

(2) In the top black section 1102 of the selected control panel 1100, the 
user gives this new parameter set a name if not already done. The 
name will be used to label the outputs. 

(3) In the blue section, the user enters a sheet name (in 1104) and the 
column (in 1106) to be considered as the cluster labels. Note: To score 
each compound separately rather than in clusters, enter a column with 
individual compound labels as the "Cluster CoL" 



33 



wo 01/08039 PCTAJSOO/20401 

(4) The red section of the control panel is divided into two parts, with its 
action button 1108, with red text "Score and Sort Clusters", in the 
middle. Above the button, enter the names of the colors 1110 to be 
assigned point scores, along with their corresponding point scores 

5 1112. The scores are arbitrary and relative; they will be normalized by 

the system as necessary. However, a user should be sure always to 
assign higher point scores to colors which denote favorable values, and 
lower point scores to colors which denote imfavorable values. The 
cells with entries need not be colored, and need not be in score order, 

10 because the system will color and sort these cells when run. 

When assigning point values, a user should be aware that 
uncolored cells (which are most likely blank, i.e., unknown data) may 
have quality values above or below those that contain grouped and 
colored data.. The user may decide that some of the colored groups 

15 are "better" or "worse" than data being unknown, and can assign a 

score to the color ^'NONE" accordingly. 

(5) Below the "Score and Sort Cluster" button 1108, the user enters the 
columns 1114 to use for scoring (using the same syntax as for the Data 
Coloring) and their corresponding relative weights 1116. The numbers 

20 for weights are arbitrary and relative; they will be scaled by the system 

as necessary. Note that a line with multiple coliunns will assign the 
entered weight to each of the columns. 

(6) The user the selects (clicks) the red-text "Score and Sort Clusters" 
button 1108. 
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(7) When the scoring and sorting tool runs (on the system 100), if the 
system detects uncolored cells in the data, the user will be offered two 
modes of dealing with them automatically, or a third manual option of 
stopping to color them. The two modes are: 

• "Use score for the color "none" on my list" 
(RECOMMENDED) 

• "uncolored = average of other colors in row". 

(8) The program then scans the chosen columns in each row and adds up 
the chosen colimm's color scores for that row. These scores are then 
averaged for each cluster of rows, as defined by the user-selected 
"cluster column." The VB program code which accomplishes this is as 
follows: 

For icol = 1 To ncols 
colorcode = 

Cells (irow,colnum (icol) ) . Interior . Colorlndex 
color foimd = False 
' Add up the weighted scores 
For j =s 1 To ncolors 

If (icolor(j) = colorcode) Then 
j score = j 
colorfound = True 
Exit For 
End If 
Next j 

If colorcode = xlNone And treatblanks = "AVERAGED" 
Then colorfound = True 

If Not .colorfound Then 

Cells (irow, colnum(icol) ) .Select 
If colorcode = xINone Then 
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thiscname = " none " 

Else 

Call ColorNameToIndex (thiscname, colorcode. True) 
End If 

addscore = score (jscore) * colweight (icol) 
cmpdscore = cmpdscore + addscore 
' Next IF-THEN- ELSE hlock is 

' special calculations for the "averaged" mode 

If color code = xlNone Then 

lostweight = lostweight + colweight (icol) 
minscore = Application.Min(minscore, 0) 
maxscore = Application.Max(maxscore, 0) 

Else 

cmpdscore2 - cnTpdscore2 + addscore 
End If 
Next icol 



( 9 ) The scores are then normalized for the various cluster sizes (number 
of rows per cluster), and scaled to a value of one hundred (100) for a 
row which is colored entirely with the user's highest-scoring color and 
a value of zero for a row that is colored entirely with any color to 
which the user has assigned a score of zero. 

If clusterscore (icluster) = 0 Then 

' do nothing^ 
Elself clusterscore (icluster) > 0 Then 
clusterscore (icluster) = 

100 * clusterscore (icluster) / (nrows * maxscore) 
Elself clusterscore (icluster) < 0 Then 
clusterscore (icluster) = 

100 * clusterscore (icluster) / 
(nrows * (-minscore)) 

End If 
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(10)The results are presented as two newly inserted worksheets. The first 
is named by appending the word "SCORES" to the name of the 
origmal data sheet, and contains a list of the clusters with their sizes 
and scores. 

(1 l)The second new sheet is named by appending the word "SORTED" to 
the name of the original data sheet. The "SORTED" sheet contains a 
copy of all the original data and coloring, but with the rows re-ordered 
to place the highest-scoring clusters at the top, and all the clusters in 
descending score order from there down. 

(12)The user has two additional options regarding the appearance of the 
"SORTED" sheet: (a) a colimin contsdning the numerical scores can be 
added; and (b) the columns that were not used in the scoring can be 
hidden, so that only the ones actually used remain visible. 

An example of user provided data is shown in the control panel in 
Figure IID which is taken from the cluster control worksheet shown in 
Figure 11 A. As shown in Figure 11 A, the parameters are stored with the name 
"Cmpd" 1102. The scoring a sorting parameters in the control panel 1100 of 
Figure IID give the color red a score of "-1", orange has a score of "0", yellow 
has a score of"!" and light green has a score of *'2". Columns C and D have 
relative weights of "1", as does colunm E. 

Note on the output of score and sort clusters: The system inserts two new 

sheets after the data (see, e.g.. Figures 14C-14D). The first added sheet contains 

two score columns: the scores generated by both of the auto modes (uncolored — 

zero and uncolored = average), but the one not selected will be gray. The scores 
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are on a scale of "-100" to "+100", where a score of "-100" means that all cells 
had the maximally negative score available, and a score of "+100" means that all 
cells had the maximally positive score available. The second added sheet has 
clusters sorted according to the one auto mode chosen when the tool ran. The 
routine offers to hide all columns that were not used in the scoring and sorting. 
The user can selectively xmhide certain columns by using the "EditiGoTo" menu 
option (or typing "CTRL-G' •)» enter the columns in the "Reference" box (for 
example, C:F), then pick the "Format: Column:Unhide" menu option. 

If the user wants to see a color-score-sorted list of compounds within a 
particular cluster (such as the best cluster), the user should do the following: 

1 . Sort by clusters to find the ID of the cluster wanted. 

2. With a second rule, sort by compoimds. 

3. Go to the "SORTED by Compoimd" results sheet and tum on Excel's 
"Data:Filter:AutoFilter" feature for the column that specified the 
clustering in the first sort. The user can then choose to view only the 
compounds in one particular cluster, and they will be in compoimd- 
sorted order. 

Example 

With reference to the already-colored worksheet "DEMO 1" shown in 
Figures 8A-8B, the cluster control worksheet shown in Figure 11 A, and the 
control panel shown in Figure IID, application of the scoring and sorting of 
clusters is described. As noted above, in the control panel of Figure IID, the 
parameters are stored with the name "Cmpd" 1102. The color red has a score of 
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orange has a score of "0", yellow has a score of "1" and lightgreen has a 
score of "2". Columns C and D have relative weights of "1", as does column E. 

Application of control panel "Crnpd" of FIGURE IID. by selecting "Score 
and Sort Clusters", produces the worksheets shown in Figures 14A-14B. When 
the user selects the "Score and Sort Clusters" button 1108 for the "Cmpd" control 
panel of FIGURE IID, the system first presents a dialog box (1402 shown in 
Figure 14A) asking the user how un-colored cells should be scored for sorting. 
As noted above, un-colored cells can be scored explicitly by user entries 
(recommended) or as the average of the colors in the same row. Once the user 
, makes a selection and clicks on the "OK" button, the system scores and sorts the 
data, producing the display screen shown in Figure 14B. The system provides a 
summary of what was done, including the information about the two new sheets 
("DEMO 1 SCORES by Cmpd" and "DEMO 1 SORTED by Cmpd un=ze") 
which are added to the workbook. Figures 14C-14D show the data in the newly 
5 created worksheet "DEMO 1 SCORES by Cmpd". 

Example 

With reference to the already-colored worksheet "DEMO 2" shown in 
Figures lOA-lOB, the cluster control worksheet shown in Figure 11 A, and the 
control panel shown in Figure IIC, appUcation of the scoring and sorting of 
clusters is described. In the control panel of Figure IIC, the parameters are 
stored with the name "acids" (1102). The color red has a score of "0", orange has 
a score of "1", yellow has a score of "2" and light green has a score of "3". 
Column D has a relative weight of"!". 
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The application of the parameters or rules in the "acids" control panel 
produces two new worksheets ("DEMO 2 SORTED by acids" and "DEMO 2 
SCORES by acids") shown in Figures 15A-15B. 



4. Score and Sort Dnse-Resoonse Data, 
Data grouping and visualized by color coding has also been found to 
enable an automated solution to another vexing pattern recognition problem. An 
HTS lab is currently able to provide dose-response data on some subset of the 
whole collection of compounds originally tested. Sometimes, logistical 
constraints (time and/or cost) dictate that only a few concentration points can be 
run on each compound, and the high-throughput nature of the process generates 
somewhat noisy data. A similar situation somethnes exists in other biological 
laboratories where assays are very time-consuming. Dose-response curves with 
few. noisy points are difficult to analyze by traditional curve-fitting methods. The 
present invention includes a mechanisms/algorithms for analyzing percent-of- 
maximal-effect data and accurately ordering the compounds by potency, even 
when faced with few points and high noise. 

The mechanism recognizes two properties of the dose-response data for 

each compound: 

1 . "Dose-responsiveness," the drop-off of activity with dilution, is taken 
as a sign that the compound has some reasonable pharmacological 
mechanism of action. 
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2. The activity measurements at the various concentrations also provide a 
confirmation of the general level of each compound's activity that was 
indicated by tiie original single-poke HTS hit. 
These two properties are somewhat independent, as illustrated by the 
example of a compound that is 95% active at aU tested concentrations. It 
demonstrates very poor (i.e., no) dose-responsiveness over the range of 
concentrations tested, but is so active tiiat it should not be ignored, because it 
might reveal a dose response if tested at even lower concentiations. 

By using the data groupings and color codes of tiie dose-dependent activity 
data columns, which help to smooth out excessively fine distinctions in tiie 
numbers, tiiis invention includes an algoritiun to assign numerical scores for dose- 
responsiveness and overall activity in tiie dose-response data. Moreover, tiie 
algoritimi also calculates a smart composite of tiiese two scores, in such a way tiiat 
a highly active compound will get ahigh composite score even if its dose- 
responsiveness is poor. This composite score is capable of extiracting useful 
information, even from very noisy data, and has been validated to correctiy order a 
list of test compounds. The system of tiiis invention adds data columns tiiat leport 
all tiiree scores for each compound, and tiiese columns can tiiemselves be color 
coded, and tiius used in further comparison to otiier types of data by compound or 
cluster scoring and sorting as described above. 

Moreover, witiun certain limits, tiie invention's dose-response scoring 
algoritiim can also be used to make quantitative estimates of IC50 values of 
compounds, even in tiie presence of large amounts of experimental error. This is 
accomplished by adding a set of hypotiietical m^er compounds witii known 

potencies and tiieoretically calculated activities at tiie test concentrations. Since 
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the ordering algorithm is reliable, these markers will be ordered into their 
appropriate place, and can be used to calibrate the ordering scores in terms of 
actual ICso's. In other words, estimates of ICso for the compounds can be 
generated by interpolating between the markers in the ordered list of composite 
scores. 

Scoring and sorting dose-response data according to the present invention 
processes several columns of colored dose-response data (activity vs. 
concentration) to assign three numerical scores that can later also be color coded, 
and thus used by the "Score and Sort Clusters" mechanism (described above) to 
compare compounds or clusters of compounds. The three scores are: 

(a) degree of dose-responsiveness over the concentration 
range tested; 

(b) overall activity level demonstrated in the dose-response 
data; and 

5 (c) a variably weighted composite of (a) and (b), designed 

to give high scores for high activity even when dose- 
responsiveness is poor (e.g., a compound that is highly 
active at all concentrations). 
The scoring and sorting dose-response data according to the present 
20 invention bases its scoring on colors rather than absolute activity numbers. The 
mechanism takes user input from a Cluster Control worksheet, e.g., as shown in 
Figures IIA-IIB. Figure IIB shows a control panel from the cluster control 
worksheet shown in Figure llA, wherein the user has selected columns F to I of 
worksheet "DEMO 1" for scoring dose-response. 
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The system detects uncolored data, notifies the user, and asks whether to 
continue. If yes, the system skips the row containing the uncolored data. The 
system inserts three new columns on the original spreadsheet to contain the new 
scores, the new colunms immediately following the columns of dose-response 
data. The column headings show the name of the parameter set used for scoring. 
Preferably, the system offers to regenerate existing table of Sheet Statistics to 
correct it for newly added score columns. Further, the system offers to sort the 
data rows by decreasing score. The system also offers to carry out quantitative 
estimates of IC50 values for the user's compoimds, by adding artificial calculated 
calibration marker compounds. 

In order to score and sort dose-response data: 

(1) Ensure that the dose-dependent activity data columns are ordered 
with highest concentration at the left and lowest concentration at 
the right. To ensure this, the system will remind the user of this 
requirement and ask him to confirm it when this tool is run. Note\ 
If the data are for an undesired effect such as toxicity, the colimms 
should be ordered the opposite way (lowest concentration left, 
highest right). 

(2) Use the Data Coloring (described above) to color the dose-response 
data colunms. 

(3) Go to the Cluster Control sheet 314 (Figures IIA-IID) and pick 
a control panel 1100 to use. On this sheet, a single control panel 
extends vertically through the black, blue, red, and green sections, 
and provides input information for several tools. 
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(4) In the top black section, give this new parameter set a name (1102) 
if not already done. The name will be used to label the outputs. 

(5) In the blue section, enter a sheet name (1104). 

(6) In the red section (1110), enter the colors used to color the dose- 
dependent data, and relative point scores (1112) to be assigned to 
these colors. 

(7) In the green section (1118)^ enter the columns which contain the 
dose-response data (using the same syntax as for Data Coloring). 

(8) Click the green-text "Score Dose-Response" button (1120). 

(9) If the data are expressed as "percent of maximal effect," the user 
can follow the prompts to add calibration markers and make 
quantitative estimates of ICso's. 



Note on the output of score and sort dose-response: the system inserts 
three score columns after the dose-dependent data. The three scores are all scaled 
to a 0-1 GO range, and have meanings as follows: 

(a) degree of dose-responsiveness over the concentration range 
tested: 

100 = smoothly decreasing with dilution, spanning the 

entire range of color groups; 
75 = flat dose-response; and 
<75 = even more poorly behaved 

(b) overall activity level demonstrated in the dose-response data 

100 = highest activity color group at all concentrations. 
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(c) a variably weighted composite of (a) and (b), designed to give 
high scores for high activity even when dose-responsiveness is 
poor (e.g. a compound that is highly active at all 
concentrations). 

The Dose-Responsiveness Scoring Algorithm 

The data colimms are ordered left to right, by decreasing concentration. 
The scoring algorithm awards positive scoie points for each dilution step across 
the data that actually shows a decrease in the activity data group (i.e., the color), 
and to penalize every step that does not. The algorithm uses the following 
scoring: 

• +1 point (awarded) when a dilution step moves to a lower activity group 

• 0 points when a dilution step leaves the activity group unchanged 

• -3 points (penalty) when a dilution step moves to a higher activity group 
The maximum score would go to a compound that shows all the possible 

color group steps in the right direction, and has no reversals. The minimum score 
would go to a compound with all the possible reversals, and no correct steps. The 
program then scales the extremes to 100 and 0, in order to present a consistent 
interface to the user. 

The relative magnitudes of the scoring parameters were empirically arrived 
at by testing "complete sets" of color patterns. This is possible because of the data 
simplification afforded by the value grouping. If we define the following 
numerical parameters: 

C = number of colors used, i.e., number of data value groups 
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P s number of points measured, i.e., number of different concentrations 
(doses) tested, 

then the entire "universe" of possible color patterns includes (C**) different cases. 
For some typical values that might be encountered in real HTS data, this total 
number of cases is manageable in Excel, as shown by Tabl£ 1, below. 

Table 1> Total Number of Color Patterns 



p = 


C = 


total number 


# of cone. 


# of color groups 


of 


Points 




possible cases 


3 


3 


27 


3 


4 


64 


4 


3 


81 


4 


4 


256 


5 


3 


243 


5 


4 


1024 


6 


3 


729 


6 


4 


4096 


7 


3 


2187 


7 


4 


16384* 



For a spreadsheet with a heading row, this exceeds Excel' s current limit (for 
EXCEL 95) by one. This value should not exceed the limit for Excel 97. 

Scoring was done on several of these complete sets, hi each set, the results 
were sorted by decreasing score and compared to "intuition" for general 
correctness of ordering of dose-responsiveness, and scanned for cases deemed to 
be clearly out of order. The (+1, -3) score set was foimd to produce satisfying 
ordering, while lesser penalties led to poorly ordered results. More objective tests 
of ordering (described below) were then used to fiirther validate the algorithm 

The case of P=3 and C=3 is presented below in its entirety for illustration. 
Table 2 (Figure 16F) shows artificial data and processing for twenty seven (27) 
hypothetical compounds. The "percent inhibition" columns represent assay 
"data." If one defines three groups by breakpoints at 33% and 66%, each cell is 
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assigned to ^ data group as shown in the middle set of three columns. Here it is 
clear that the order of compounds in this table is systematic (111, 112, etc.), to 
illustrate that the complete set is present. The third set of three colunms shows 
color coding, with the darkest being least active and the lightest being most active. 

Then the data set was processed by the system to yield dose- 
responsiveness scores, and the results sorted by this score, giving Table 3 
(Figure 16G), the complete set in order of decreasing dose-responsiveness. 
Table 3 also shows the intermediate step-scoring and unsealed score points, to 
aid in following and understanding the algorithm. These points are not displayed 
by the system itself. 



The Overall-Activity Scorinp Algorithm 

The second property of interest to be extracted from the data is the overall 
activity level exhibited by each compound. As explained above, this is largely 
independent of the dose-responsiveness. 

The data value groups' ordinal index numbers are used as single-point 
activity measures instead of the original data numbers. Extra weight is given to 
activity shown at lower concentrations by the simple algorithm of weighting each 
data column by its serial position, again ignoring the actual concentration values. 
The scores are then scaled to the range 0 to 100. The results of this scoring on the 
same complete set are shown in Table 4 (Figure 16H) which has been re-sorted 
by decreasing overall activity. 
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The Composite Scoring Algorithm 

Comparison of Tables 3 and 4 (Figures 16G & 16H) shows clearly that 
the compound ordering by dose-responsiveness is quite different from the ordering 
by overall activity. The user (a chemist) could now color-code the new score 
columns and use them as independent factors in a larger scoring. However, 
chemists also want a single index of compound quality derived from the dose- 
dependent data. Moreover, a composite index would further help to alleviate the 
effects of noise on data interpretation, by incorporating more information into the 
ordering process. This is an "information-based smoothing'' of the data. 
Therefore, a procedure to calculate a third, "smart composite" score from the other 
two scores was devised. 

The general idea is that when selecting good compounds from dose- 
response data, compounds showing overall high activity should not be discarded 
for lack of responsiveness. Therefore, the smart composite score should give 
more weight to the overall activity when the overall activity is high, but lower 
weight when it is low. A generalized weighted average is written as 

composite score = (activity weight) (activity score) + 

(responsiveness weight) (responsiveness score) 
or, defining corresponding symbols: 

Sc = (Wa)(Sa) + (Wr)(Sr) 
If the weights are normalized to sum to unity, then ihis becomes 

Sc = (Wa)(Sa) + (1-Wa)(Sr) 

The activity weight Wa varies with the activity score Sa in such a way as 
to achieve the desired result. 
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The functional form of this variation was the subject of empirical testing. 
It was decided that the limits would be that Wa would approach 0,5 (activity and 
responsiveness equally weighted) in the limit of low activity, and that Wa would 
approach 1.0 (responsiveness ignored) in the limit of high activity. The actual 
variation was encoded as an exponential increase in order to have rather sharp 
onset of the activity bias at higher activities: 

WA = (Ci)exp[(k)(SA)] + C2 

The value of the coefScient k=0.06, for which the activity bias starts to 
become substantial around an activity score of eighty (80), was chosen for 
implementation in a preferred embodiment of this invention, according to 
empirical results. Figure 161 shows the variation for a few values of k. Table 5 
(Figure 16 J) shows all three scores for the example complete set, now sorted by 
decreasing composite score. 

The details of the scoring algorithms were arrived at largely by comparing 
results to intuitive ordering of the test cases in the complete sets. Because the sets 
were complete, no really new results can be generated by further test sets. 
However, one can generate test activity data sets from compounds of known 
potencies, whose real rank ordering is thus known, in order to see more 
objectively how well the scoring algorithms rank the results. 

To this end, a set of pseudo-ligands was hypothesized, with dissociation 

constants from a fictitious receptor ranging from nanomolar to millimolar (pK = 9 

to 3). The set included thirty one (31) compounds, with potencies evenly spaced 

by 0.2 log imits (9.0, 8.8, 8,6, ... , 3.4, 3.2, 3.0). A "pseudo-screen" was created 

which ^^ested" binding of these ligands at five concentration points in the usual 
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range: 10'^ M, 3x10"^ M, 10"^ M, 3x10'^ M, and 10"^ M. Note that the span of 
potencies exceeds the span of concentrations tested by two log units on each end, 
so the test set includes both "very active" and "very inactive" compounds relative 
to the screening concentrations. 

Then artificial binding data were created by calculation as follows. 
Assuming a simple binding equilibrium of the ligand to a receptor, the "percent 
inhibition" at a given ligand concentration is equal to the fraction of receptor sites 
which are occupied by ligand, given by simple equilibrium equations as 

Pinhib = 100 • Gigand) / [K + (ligand)] 

For a more realistic simulation, artificial random noise was then added to 
the calculated numbers. The first experiment reported below used noise randomly 
distributed over the range of ±10 inhibition percentage points, and the second 
with noise up to +30 inhibition percentage points. Note that this means ten or 
thirty percentage points of absolute error, not 10% or 30% of the value. 

The artificial data were then color-coded according to the mechanisms of 

this invention (described above) into four color ^oups, using the simple 

breakpoints at 25, 50 and 75 percent inhibition. Note that in assigning these 

brea]q>oints, no consideration was given to the actual data values. Then the 

scoring algorithms of this invention were run, and the compounds sorted by the 

composite score. Rank order numbers were assigned to the compounds, with 1 

being the most potent and 3 1 the least. In cases of ties in the composite score, 

equal rank nimibers were assigned, with a value equal to the average of the rank 

numbers spanned by the tied group of compounds (e.g., a tie for 2nd and 3rd 
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would result in each compound being ranked "2.5"). For each experiment, the 
final rankings were plotted against the "real" rankings by known potency, to test 
how well the scoring algorithms ordered the compounds. These plots are shown 
in Figure 16K (for noise=10) and Figure 16L (for noise=30)- 

For the experiment with noise up to 10 inhibition percentage points, shown 
in Figure 16K, the ranking of the composite scores is "perfect" (in the sense of 
having no inversions) over the range of tested concentrations (pK = 5 to 7). The 
pseudo-screen is imable to distinguish the potencies of compounds above or below 
this range. 

When the noise is much higher (30 percentage points), the ranking of 
individual compounds is not as precise, but one can identify three cleanly divided 
"good-mediimi-bad" groups, as indicated by the dashed boxes on Figure 16L. 
Thus, even with this rather extreme noise level, the invention's scoring still 
successfully prioritizes the compounds into groups. The range where 
discrimination is effective is still roughly the range of the test concentrations (pK 
= S to 7), but has been reduced somewhat by the higher noise. Note that the 
ranking within this range (the middle boxed group) is still mostly correct, with 
only one inversion, even for single compounds. 

Quantitative Estimation of Potencies 

With confidence established that the algorithms provide reliable rankings 
of compounds by potency, it is possible to proceed to making quantitative 
estimates. The method uses calibration marker compounds. 

To understand this method, it is helpful to realize that the concept is 
analogous to the quantitative use of SDS polyacrylamide electrophoresis gels to 
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measure protein molecular weights. The proteins are known to migrate through 
the gel with speeds directly dependent on molecular weight, but it is difficult to 
calculate the absolute migration rates for a particular experiment. In dose- 
response scoring, the compounds are known to be properly ordered, but it is not 
clear how to calculate a potency (e.g., Kdiss or IC50) directly from the score. 

Protein chemists solve the molecular weight problem by running marker 
proteins, with know molecular weights, in the same gel, then using their band 
positions as calibration for the unknowns. Analogously, this invention's 
quantitative estimation method uses hypothetical marker compounds of know 
potency to intemally calibrate the dose-response composite scores for the user's 
choice of a coloring rule, then interpolates the potencies of the unknowns. 

To create markers, the system asks the user to input the concentrations 
used for each of the dose-dependent activity data columns. The system then picks 
a set of calibration concentrations, at intervals of O.S log units, to span the tested 
range. For each of these calibration concentrations, a marker compound is created 
and added to the user's compoimd list, and artificial data is calculated for each 
column, from the same simple equilibrium binding equation used above in the 
validation study (this time with no "noise"): 



The marker data are then colored by the same rule used for the user's 
compounds, and the scoring and sorting algorithm is re-run. 

The result is that the markers are sorted into the list according to their 
potencies, and the potencies of the other compounds can be estimated by 



Pinhib = 100 • (ligand) / [K + (ligand)] 
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interpolating between the markers, using the composite dose-response scores. To 
illustrate, a typical section of a sorted list is shown below in Table 6 
(Figure 16M), using four colors. 

Potencies for compoimds that fall between two markers are calculated by 
linear interpolation between the logarithms of the markers. Given the various 
uncertainties in the data values themselves and in the evaluation process, it was 
found that linear interpolation between markers spaced at 0.5 log unit intervals 
was sufficiently precise, and no more complex curve fitting was necessary. 

Validation of Quantitative Estimation 

Validation of the quantitative estimation method followed a procedxire 
very similar to that used to validate the scoring, and using the same sets of test 
data with various noise levels. As before, the testing concentrations were from 10" 
^ to 10'^ M (negative log from 5 to 7). Marker compounds (no noise) were added 
witfi pK*s from 4.5 to 7.5, and IQiss estimates for the noisy compoimds were 
carried out by the interpolation method. The results are shown below for the cases 
of 10 and 30 inhibition percentage points of noise. 

Figures 16N and 16P show that the estimates are clearly quite good 
within the range of the testing concentrations (pK 5 to 7), but the quality of 
estimation deteriorates quickly beyond those limits, and algorithm does not 
reliably distinguish among compounds whose potencies are more than a half log 
unit beyond the testing range. Therefore, it was decided that presently preferred 
embodiments would not report any estimated values that fell outside the range of 
concentrations used in the testing data columns. Thus, in the example in Table 6 
(Figure 16M), the lowest testing concentration was 10*^ M (= OTl kiM). For the 
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first compound in Table 6, the system has estimated a potency with pICso > 7, 
but it conservatively only reports "<0.1 /M." 

Table 7 summarizes the statistics of the estimations within the testing 
limits. Table 7 shows that the method successfully estimates the potencies within 
about a factor of two, even with high noise levels. 



Table 7. Statistics of Estimation Validations 



percentage 


number of 


average of 


Average error 


points 


compounds 


absflog error) 


in IC50 


of noise 






(factor) 


10 


13 


0.22 


X 1.6 


30 


13 


0.39 


x2.5 



Comparison to Other Methods of Quantitative Estimation 

Further corroboration was obtained by treating some real data from T-cell 
proliferation blockage assays. It is estimated that these data have at least as much 
noise as the artificial test set with 30 inhibition percentage points added. The 
standard treatment of this data in the past has been to fit a dose-response curve 
with a Hill coefficient of 1, using a PC-based program ORIGIN: (ORIGIN is a 
data analysis program from Microcal Software, Inc. of Northampton, 
Massachusetts. ORIGIN is used in this instance for non-linear least-squares 
fitting of dose-response curves to functional equations.) 

The data used here were from testing in the concentration range from 1 to 
0.03 fJM, (negative log from 6 to 7.S). The plot in Figure 16Q shows the 
correlation of values estimated by this invention with values from ORIGIN fits. 
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Two compounds that the present invention estimates to be beyond the testing 
range, i.e., pICso below 6, are included as open diamonds, for illustrative purposes 
explained below. (As explained above, preferred embodiments of this invention 
would nomially not report these values.) 

The results are consistent with the properties observed in the validation 
study. The present invention estimates are quite good within the testing range (6 
to 7.5). At the lower limit, this invention has made two estimations exactly at 10"^ 
M (arrows) which do not correlate as well with the ORIGIN fits. Nevertheless, 
because the whole plot spans only a relatively narrow range of potency, even these 
discrepancies are not very large. For all twenty eight (28) estimates within the 
testing range (including these two), the average logarithmic deviation between the 
two estimation methods is 0.18, corresponding to a factor of only l.S. 

It is further noted that the calibration marker estimation method does not 
uniformly "flatten out" beyond the testing concentration range. The two open 
diamonds in Figure 16Q are estimations that presently preferred embodiments of 
this invention would not normally report because they have pICso < 6, but they 
agree well with ORIGIN fits 

It is interesting to compare calibration-marker with curve-fitting results for 
particularly badly behaved data, such as dose-response curves that are not 
monotonic with respect to concentration. This is sometimes the type of data that 
emerges from dose-dependent screening in a high-throughput mode. Studies of 
this type have been initiated by adding artificial noise to the extent of fifty (50) 
inhibition percentage points. 

Finally, it should be pointed out that there is some mechanical advantage 

to using the present invention relative to current practice of using ORIGIN. 
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ORIGIN is used by manually filling in a template with data, then manually 
executing a fit. Depending on the number of points and the degree of 
customization of parameters, this can take one to ten minutes of the user's time. 
The present invention, on the other hand, processes a whole spreadsheet at once 
(i.e., up to 16,383 compounds), and goes at a rate of about 3,000-4,000 
compoimds per minute on a 200 MHz PC. 



Examples 

Figure 16A shows dose response data for twenty (20) compounds at four 
concentrations. The data have been grouped and the cells colored by the rule 
shown in Figure 16B. The result of the scoring and sorting process is shown in 
Figure 16C, where the compounds are ordered by decreasing values of the 
composite score (column H). Then, virtual "marker'^ compoimds are added with 
known potencies spaced by O.S log imits, and they are shown in Figure 16D, 
colored by the same rule and scored. The name of each marker compound 
designates the logarithm of its potency, e.g., "marker_7.0" has a potency IC50 = 
10'^ M. Figure 16E shows the result of sorting the list by decreasing composite 
score after adding the markers. This process then enables estimation of IC50 
values for the compounds by interpolating in the column (H) of ordered composite 
scores, and these estimates appear in two forms in columns I and J. 

6. Summarize Spreadsheet Statistics Mechanism 

This mechanism creates a table summarizing the entries in each column of 

a data sheet, to aid the user in deciding how to color each column. The 
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mechanism counts numeric, text, and data entries, and uses color to flag columns 
that have mixed types. The mechanism also counts blanks, and specially flags 
coliunns with 'trailing blanks," i.e., colunms which are shorter than the longest 
one on the spreadsheet. For numeric data, the mechanism calculates minimum, 
maximum, mean, and standard deviation, even in the presence of interspersed text 
entries. For text data, the mechanism presents a list of the text strings used and 
their occurrence counts. The mechanism creates a summary key of the column 
letters and headings as a text box that can be copied to other sheets for convenient 
reference. 

Figure 17A shows a sample spreadsheet containing miscellaneous data 
on twenty four (24) compounds. Figure 17B is the statistics sheet calculated 
from it. Each row of the statistics sheet describes one column of the original data 
sheet. First, the counts of numeric, text, date, and blank entries are listed, 
followed by two columns describing the total length of the data sheet. Then the 
minimum, maximum, mean, and standard deviation of any numeric data are 
reported. Finally, the statistics sheet lists a siunmary of the text strings foimd in 
each original data column. As examples, in Figure 17B, one can see that 
original colunm A ("Cmpd") had twenty four (24) different text strings, that the 
numeric data in original colunm C ("Testl") had a mean of 2.385, and, flagged by 
the red coloring, that original data column E ("Test3") had a mixture of ten 
numeric data and two text strings, both ''N.A." 

The details of how the program code accomplishes this are 
straightforward, and one of ordinary skill in the art would know, from this 
description, including the Figures, how to make and use this invention. The 
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program loops through all the entries in the column, testing the data type of each, 
and tallying the counts and numerical statistics. 



Spreadsheet Creation and Organization 



The operations of this invention require a considerable amount of user 
input, e.g., to create well-structured spreadsheets, to define and apply diverse 
coloring rules for large numbers of columns, and to use these colors and the user's 
stated scientific priorities to create meaningfully ordered lists of compoxmds or 
clusters. 

The user interface of this invention has been designed to ease this process 
and help the scientist focus on the tasks of formulating and recording clear 
descriptions of the evaluation parameters. Accordingly, this invention provides a 
number of tools and mechanisms to aid in the creation and organization of 
spreadsheets. These tools and mechanisms include: 
Smart Append Column Mechanism 
Merge Data Mechanism 
Data Import Mechanism 
Workbook Navigation Shortcuts 
Conversion of "uM" to ^M and "UU" to ^ 
Delete Pictures Mechanism 
Change Values in Column Mechanism 
Concatenate Values across Columns Mechanism 
Delete Leading Inequality Signs Mechanism 
Delete Derived Sheets Mechanism 
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Smart Append Column 

This mechanism appends new coiimms of data onto an existing 
spreadsheet, matching rows by labels (e.g., compound numbers). The mechanism 
copies all data to a new sheet before doing its work, leaving the original sheets 
unchanged. There is no need for the user to pre-sort any of the data. The 
mechanism provides optional case-sensitive or case-insensitive label matching. 

New rows are added at the bottom when new labels do not match any old 
labels. Rows with missing labels are identified and the system offers to fill them 
by copying previous label. Rows with repeated labels (i.e., replicate data) are also 
identified and the system offers a choice fi*om among several automated 
processing rules, or manual fixing. A fast matching algorithm temporarily sorts 
rows by label, then restores original order when finished. Several intermediate 
stopping points are offered and extra data viewing options for conservative users 
worried about errors. 



Merge Data Mechanism 

The merge data mechanism copies new data values fi*om an appended 
column into an older column. The mechanism copies all data to a new sheet 
before doing its work, leaving the original sheets unchanged. The mechanism 
detects cells where new data would overwrite old data that is different, flags them 
with color, and alerts the user. -Several intermediate stopping points are offered to 
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the user, as are extra data viewing options are offered for conservative users 
worried about errors. 

Data Import 

One-button (or one-menu-click) import of existing Excel spreadsheets 
into an integrated file, which contains both the data and the related control sheets. 
The mechanism offers to search for and remove any leading or trailing spaces in 
the imported data and offers to consolidate replicate data rows into unique ones, 
using user choices as to how to handle the replicate data. The mechanism also 
detects hidden rows and offers to unhide them and detects formulas and offers to 
convert them to values. This mechanism is also used to update to newer version 
of the system. 

Workbook Navigation Shortcuts 

The system includes various workbook navigation shortcuts including: 

• A special added drop-down menu which includes commands for jumping 
directly to the various control sheets. These commands also have keyboard 
shortcuts assigned to them. 

• From a cell on a control sheet that contains the name of a data sheet, a 
special item on the right-click shortcut menu jumps directly to that data 
sheet. Other special items on this menu enable a "Twin Screen" display to 
see two sheets at once. 
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• To aid in choosing columns to enter on control sheets, there is a special 
"Twin Screen" display triggered by right-clicking any cell on a control 
sheet that contains the name of a data sheet. 



Convert "uM" to uM and "UU" to u 

Preferred embodiments of the system of this invention require the data 
spreadsheet to have one and only one row of column headings. The user can type 
either of the encoded strings "uM" (lowercase u, uppercase M) or "UU" (both 
uppercase) into any column heading, select the cell or whole row of headings, then 
pick this command. Each '"uM" in the selection will be converted to "^iM", and 
each "UU" will be converted to a "ji"- Th® recognizes the special exception 
of the word "VACUUM'' as long as it doesn't end with the cases "uM." This 
conversion allows the user to avoid the confusing use of lowercase "u" or the 
column-widening use of the full prefix "micro." This utility appears on the 
system menu. 



Delete Pictures 

The system provides a mechanism for removing pictures containing 
chemical structures, in order to reduce file size, processing time, and confusion 
when they do not align properly after row sorting. 
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Change Values in Column 

This is a mechanism for regularizing data in a spreadsheet column. It 
facilitates replacement of all occurrences of a given value by another. The 
mechanism creates backup copies of the original colvmin, and updates any existing 
data statistics for the edited sheet 



Concatenate Values across Columns 

The system provides a mechanism for regularizing data in a spreadsheet 
colimm. Some possible uses include: (a) construction of unique row labels: M- 
niunber plus stroke number "Ml 23456/001"; and (b) reconstitution of 
numerical inequalities from separate columns: plus a nxmiber ">number'\ 
The user is provided with an option to include linking (delimiting) text strings 
between values and an option to include or skip blanks. The system retains the 
original colxmms and inserts a new one for the results. 

Delete Leading Inequality Signs 

Another mechanism for regularizing data in a spreadsheet column includes 
the mechanism to delete leading inequality signs. This mechanism converts 
entries like ">1000'* to just the number "1000". This must be used with 
considerable caution, because it is the equivalent of creating a false test result. It 
is generally preferable to color the cells containing text strings with the data 
coloring mechanism described above, rather than alter them. All later processing 
is based on the colors, not the cell values. This mechanism also deletes inequality 
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sign only if it is the first character in ttie cell. The mechanism creates backup 
copies of the original colunms. 



Delete Derived Sheets 

The system menu includes a command to delete all output sheets from the 
current workbook, with separate user confirmation for each one. This is intended 
as a cleanup mechanism for infomiation that may be outdated and is easily 
regenerated by subsequent system runs. 

Initial experience with the coloring tool has revealed that color coding has 
more subtle, but far-reaching usefulness. The colors themselves also can act as a 
"currency of exchange," a medium for comparing the quality of one kind of result 
to the quality of a very different kind of result. For example, an HTS activity of 
"95% inhibition"' may be considered desirable and color coded, e.g., green. In the 
same list of compounds, a molecular weight between 400 and 600 may be consid- 
ered optimally desirable, and thus also color-coded green. If the user takes care 
when assigning colors, "green" takes on a common meaning across the board. 
This translation of data values into colors then opens up a cornucopia of 
possibilities for processing the colors (as numerical color indices) and comparing 
compoimds, searching, in our example, for the ones that are the "most green." 

Accordingly, the system includes tools to numerically score individual 
compoimds or clusters of compoimds by the colors that appear in their various 
data columns. The system can then create a new spreadsheet sorted by this score 
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(either by single compounds, or cluster-by-cluster, the choice being the user's), in 
which the "most green" compounds will then appear at the top. 

Examples 

Application to Portfolio Management 

The system can equally well be applied to any set of data where the rows 
are cases of a similar constmct, with the columns being various properties of each 
case. For example, a data spreadsheet can contain a list of competing projects or 
investments for a company's portfolio, with the columns containing various 
managers' ratings of each project or investment. Figure 18A shows an example 
of twenty projects, each of which has been scored 1, 2, or 3 on two factors, one 
more important than the other, by each of three managers. The sheet has been 
colored by the rule shown in Figure 18B. Then the data were scored and sorted 
by the sorting rule of Figure 18C, and the result is shown in Figure 18D. 
Clearly, the projects that were given a "3" in the important factor come to the top, 
and it can be seen that the less important factor does indeed matter less to the final 
ordering. The colors also help to flag anomalies, such as a low score by one 
manager on an otherwise high-ranking project. 

In general, the data can be various sorts of data. Some examples are listed 
below and illustrated in the referenced Figures. 

Figure 19 shows a list of dmg candidate compounds, scored and sorted 
by a composite of ten parameters that describe their physical, chemical, and 
biological properties. Green shades indicated desirable values; red shades are 
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undesirable. The display is compressed vertically with the vertical re-scaling tool 
to clearly display the difference in coloring patterns between the top eighty (80) 
compounds and the bottom eighty (80) compounds (separated in the illustration by 
a blank band). 

Figure 20 shows a list of proteins that are candidates for targets for drugs, 
chosen from a pool of candidate genes, scored and sorted by a composite of eleven 
parameters that describe their suitability. 

Figure 21 shows a list of research projects competing for resources. Each 
project has been scored according to several evaluation factors, and the whole 
array has been sorted by color groups. The same construct is useful for evaluating 
employee performance or job candidates. 

Figure 22 shows a list of pharmaceutical companies and their current 
status with regard to discovering or marketing products in each of various disease 
areas. Each company's line of products has been scored according to the maturity 
of the offerings, and the whole array has been sorted by color groups. 

Figure 23 shows the use of data-grouping (coloring) rules to visualize the 
time courses of drug concentrations in blood. In this example, light colors were 
chosen to represent high concentrations of drug in the blood, while dark colors 
were chosen to represent low concentrations. The figure shows a wide range of 
differing time courses. 

Figure 24 shows the use of data-grouping (coloring) rules to visualize the 
matrix of pairwise cross-correlations of the results of eight (80) drug screens. In 
this example, light colors were chosen to represent low correlations, while dark 
colors represent high correlations. 



65 



wo 01/08039 




PCT/USOO/20401 



Quantitation of the Similarity of Data Grouping in Two Variables 

As part of the present invention, a mechanism is provided for assigning a 
quantitative measure to the degree of similarity of grouping (visualized by color 
coding) of data in each of two columns of an Excel spreadsheet. The mechanism 
allows for a correlation-like analysis on a wide variety of data types, including 
text, or mixed nimibers and text. 

In the data-exploration paradigm of the present invention, one of the first 
steps a user takes is to divide the rsmge of data values in each column into a small 
number of groups for further analysis, thus effecting a reduction of precision 
which has been found to be useful in various ways. 

It is sometimes useful to explore whether the rows of the data matrix have 
been divided into similar groupings in each of two different columns. For 
example, a researcher might ask, "Do the high molecular weight compounds tend 
to be the ones whose solubilities fall below the limits of measurement?'' In other 
words, this would mean to compare the groupings in the molecular weight column 
with the groupings in the solubility column. 

If the data were strictly quantitative, this would be called correlation of 
variables, and there exist a number of perfectly good statistical measures of the 
phenomenon. However, one of the imique capabilities of the present invention 
lies in dealing with textual data and mixtures of nxmibers and text, and it would be 
helpful if one could translate the visible color patterns of the present invention to 
some kind of quantitative measure of correlation. In order to avoid confusion with 
standard statistical correlation, the distinct term "color grouping similarity" is used 
to describe the new measure. 
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In preferred embodiments of this invention, the data grouping is stored in 
the form of the colored backgrounds of data cells. At first glance, one might 
simply seek to compare the colors of the first column with those of the second, 
and count the nximber of rows with matching colors. However, a color grouping 
similarity tool must be able to cope with the possibility that the colors are 
different. This could happen because the user chooses completely different color 
schemes for the two colimins, or because the correlation is negative. As an 
example of negative correlation, suppose column A contains random numbers 
between 0 and 1, colored such that those above 0.5 are green and those below 0.5 
are red. Then imagine a column B where each value is equal to one minus the 
corresponding value in column A, i.e., the "one's complement.'' If the user colors 
the second column with exactly the same coloring rule as the first, every row will 
have a different color in column B than in colunm A. None of the colors will 
match, though the groupings are perfectly correlated. To be successfiil, the tool 
must deliver a high measure of correlation between such pairs of columns. The 
algorithm described below was designed to perform in this way. 

Algorithm for Measuring Data Grouping Similarity 
The algorithm was derived firom semi-quantitative reasoning, as follows. 
It is based on the qualitative question, "For all rows that have one particular color 
in the first column, to what degree do they have a uniform color in the second 
column (not necessarily the same color as in the first colimm)?" The quantified 
answer to this question is then averaged over the set of colors used. 
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The details of the mechanism can be seen by example. First, to compare 
the grouping in two colimms A and B, a matrix of "ordered color pair counts" 
(OCPC) is defmed such that each matrix element OCPC,y is the coimt of 
spreadsheet rows where one finds color / in spreadsheet colimm A and color/ in 
spreadsheet column B. Then, the rows of the two spreadsheet columns are 
scanned to count the number of occurrences of each ordered color pair and thus to 
determine the values of the matrix elements. 

In the discussion below, carefully distinguish the rows and columns of the 
user's spreadsheet of data fi-om the rows and columns of the derived OCPC 
matrix. 

If the two spreadsheet columns had exactly the same colorings the nonzero 
elements of the OCPC matrix would all be on the diagonal. As a simplified 
example, consider four colors (green, yellow, orange, red) and a total of 16 data 
rows. The diagonal matrix might be (zero elements left blank for emphasis) 



color in column A color in column B — ► 


green 


yellow 


orange 


red 


green 


4 








yellow 




2 






orange 






7 




red 








3 



In the case above, there are groups of 4 spreadsheet rows colored green (in 
both columns), 2 rows colored yellow, 7 rows colored orange, and 3 rows colored 
red. 

In contrast, if the groupings were the same, but the coloring rule for the 
second spreadsheet column used the same colors in a different order, the OCPC 
matrix might look like the following, no longer diagonal: 
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color in column A color in column B — > 


green 


yellow 


orange 


red 


green 






4 




yellow 


2 








orange 








7 


red 




3 







A simple extension applies if entirely different colors (cyan, blue, maroon, 
purple) are used in the second spreadsheet column. The OCPC matrix might then 
be: 



color in column A color in column B 


cyan 


blue 


maroon 


purple 


green 






4 




yellow 


2 








orange 








7 


red 




3 







In any of the three cases above, the groupings are identical, and the OCPC 
matrices have the property that each matrix row and matrix colxunn has only one 
nonzero element. That lone element is necessarily equal to the siun of the row or 
column. This situation should receive the highest similarity score. 

One way to define the contrasting situation that would deserves the lowest 
similarity score would be that for each user-defined group of spreadsheet rows in 
one spreadsheet column, the other spreadsheet column has a "maximally non- 
unique" set of colors. In the corresponding OCPC matrix, this corresponds to 
each matrix row or column having a broad distribution of values rather than a 
single non-zero, a uniform distribution has been chosen as the definition of this 
state: 
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color in column A color in column B —> 


cyan 


blue 


maroon 


purple 


green 


1 


1 


1 


1 


yellow 


1 


1 


1 


1 


orange 


1 


1 


1 


1 


red 


1 


1 


1 


1 



This low-similarity state can be more precisely defined by saying that each 
element in a given matrix row or column is the average of all the counts in that 
matrix row or column. 

With these concepts defined and the OCPC matrix filled, the scores can 
then be derived. Each OCPC matrix row (corresponding to a color group in the 
first spreadsheet column) is selected in turn for scoring. Each element in the 
matrix row is given a score between zero and one, according to its linear 
interpolation between: on the one extreme, the average of the nonzero elements in 
the row, and on the other, the sum of the row or column (i.e., the maximum value 
it could have if all the others were zero). The scores are then averaged over all the 
rows of the OCPC matrix to generate a row-wise score component. 

Next, the corresponding process is applied to the columns of the OCPC 
matrix (each corresponding to a color group in the second spreadsheet colimm 
rather than the first). The resulting column-wise score component is averaged 
with the row-wise score component, then the average is scaled to a maximum of 
100 to generate the final similarity score for the two spreadsheet columns. 

Interpretation of the Similarity Scores 

Although the scores are quantitative and well-defined, their interpretation 
is best done in a partly subjective manner, based on experience. The behavior of 
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the scores is best understood by example. In Figure 25, the leftmost ("base") 
column has been compared to each of the others, and the scores are shown as well 
as pictures of the grouping patterns. Comparison of the base with the next column 
shows that the tool delivers a maximal score of 100 for identical grouping, even 
when the colors are completely different. Then, stepping across the figure toward 
the right, it can be seen how the score decreases as the grouping pattern gradually 
becomes less similar to that of the base column. All the way down to a similarity 
score of 40, it is still basically tme that the light colors are on top and the dark on 
the bottom, with increasing "noise," but when the score falls to 20, the pattern 
appears to have no correspondence to that of the base. 



Implementation of the Data Grouping Similarity Tool 

In practice, the tool allows the user to choose two sets of spreadsheet data 
colunms. The program then automatically generates all pairs containing a column 
from the first set with a column from the second set, then writes the similarity^ 
scores onto a newly inserted spreadsheet in the user's workbook. The output takes 
the form of a table where the degree of similarity is itself color-coded to aid the 
user in identifying significant cases. An example appears Figure 26. 



While the invention has been described with reference to particular 
mechanisms (algorithms, processes and functions) and architectures, one skilled in 
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the art would realize that other mechanisms and/or architectures could be used 
while still achieving the invention. 

While embodiments of the present invention have been described with 
particular setup and initialization procedures, other setup and/or initialization 
procedures can be used. 

Further, while many of the operations have been shown as being 
performed in a particular order, one skilled in the art would realize that other 
orders, including some parallelization of operations, are possible and are 
considered to be within the scope of the invention. 

While the present invention has been described with reference to analysis 
and pattern recognition in data sets relating to chemical compounds, the methods, 
systems and devices of this invention are considered to be general constructs 
covering other, non-chemical data sets. 

Thus, are provided methods, systems and devices for analysis and pattern 
recognition in large, multidimensional data sets using low-resolution data 
grouping. One skilled in the art will appreciate that the present invention can be 
practiced by other than the described embodiments, which are presented for 
purposes of illustration and not limitation, and the present invention is limited 
only by the claims that follow. 
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What is claimed is: 

1 . A method of operating on data, the method comprising: 
providing at least one user-defined grouping rule for grouping the data into 

a user-definable number of groups; and 

applying at least one of the grouping rules to the data. 

2. A method as in claim 1 wherein the data are provided in a table and 
wherein the at least one grouping rule applies to at least one user-selectable 
column of the table. 

3. A method as in claim 1 wherein the at least one grouping rule 
defines breakpoints corresponding to the user-definable number of groups, and 
wherein application of the at least one mle to the data divides the data into groups 
based on the breakpoints. 

4. A method as in claim 1 further comprising: 

presenting the grouped data in a manner that visually distinguishes the 

groups. 

5. A method as in claim 4 wherein the grouping rules associate colors 
with groups and wherein the presenting of the grouped data further comprises: 

coloring an aspect of the data according to the rules. 
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6. A method as in claim 4, wherein the data are in labeled columns in 
a spreadsheet, and wherein the at least one grouping rule specifies at least one 
breakpoint and a corresponding color for each at least one breakpoint, and wherein 
the presenting of the grouped data comprises: 

coloring each data item in the at least one labeled colimm of the data based 
on the at least one breakpoint and the corresponding color of the at least one 
breakpoint. 

7. A method as in any one of claims 3 and 6, wherein the breakpoints 
are selected from: (a) numeric values; and (b) textual values. 

8. A method as in claim 3 wherein the at least one breakpoint is 
determined automatically based on the data. 

9. A method as in claim 5 wherein the data are provided in a table, 
wherein the coloring of an aspect of the data comprises: 

coloring backgrounds of table cells according to the rules. 

10. A method as in claim 1 wherein the number of groups is fewer than 
a nimiber of possible data values. 

11. A method of operating on data, the method comprising: 
providing at least one user-defined grouping rule for grouping the data into 

a user-definable number of groups; 



74 



wo 01/08039 




PCT/USOO/20401 



applying at least one of the grouping rules to the data to generate grouped 

data; 

providing at least one user-defined scoring rule for scoring the grouped 
data according to user-defined scores; and 

applying at least one of the scoring rules to the grouped data to score the 
grouped data. 

12. A method of operating on data, the method comprising: 
generating grouped data by applying to the data at least one user-defined 

grouping rale for grouping the data into a user-definable number of groups; and 

scoring the grouped data by applying to the grouped data at least one user- 
defined scoring rule for scoring the grouped data according to user-defined scores. 

13. A method according to claim 1 1 or 12 wherein the data comprises a 
nimiber of parameters for each of a number of cases and the scoring rule 
comprises a scoring function of user-selectable parameters and user-defined 
weights for the selected parameters to be used in scoring the cases, wherein the 
scoring of the grouped data comprises: 

applying the function to the data to obtain a score for each case. 

14. A method according to claim 13, fiirther comprising: 
sorting the scored cases by score. 

15. A metiiod according to claim 14, wherein the scored cases are 
sorted individually. 
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16. A method according to claim 14, wherein the scored cases are 
sorted by cluster. 

17. A system for operating on data, the system comprising: 

a mechanism constructed and adapted to provide at least one user-defined 
grouping rule for grouping the data into a user-definable number of groups; and 

a mechanism constructed and adapted to apply at least one of the grouping 
mles to the data. 

18. A system as in claim 17 wherein the data are provided in a table 
and wherein the at least one grouping rule applies to at least one user-selectable 
column of the table. 

19. A system as in claim 17, wherein the at least one grouping rule 
defines breakpoints corresponding to the user-definable mmiber of groups, and 
wherein application of the at least one rule to the data divides the data into groups 
based on the breakpoints. 

20. A system as in claim 17, further comprising: 

a mechanism constmcted and adapted to present the grouped data in a 
manner that visually distinguishes the groups. 
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21 . A system as in claim 20, wherein the grouping rules associate 
colors with groups and wherein the mechanism constructed and adapted to present 
the grouped data further comprises: 

a mechanism constmcted and adapted to color an aspect of the data 
according to the rules. 

22. A system as in claim 20, wherein the data are in labeled columns in 
a spreadsheet, and wherein the at least one grouping rule specifies at least one 
breakpoint and a corresponding color for each at least one breakpoint, and wherein 
the mechanism constructed and adapted to present the grouped data comprises: 

a mechanism constructed and adapted to color each data item in the at least 
one labeled column of the data based on the at least one breakpoint and the 
corresponding color of the at least one breakpoint. 

23. A system as in any one of claims 1 9 and 22, wherein the 
breakpoints are selected from: (a) numeric values; and (b) textual values. 

24. A system as in claim 19 further comprising: 

a mechanism constructed and adapted to determine at least one breakpoint 
automatically, based on the data. 

25. A system as in claim 21 wherein the data are provided in a table, 
wherein the mechanism constmcted and adapted to color an aspect of the data 
comprises: 
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a mechanism constructed and adapted to color backgrounds of table cells 
according to the rules. 

26. A system as in claim 17 wherein the number of groups is fewer 
than a number of possible data values. 

27. A system of operating on data, the system comprising: 

a mechanism constructed and adapted to provide at least one user-defined 
grouping rule for grouping the data into a user-definable nimiber of groups; 

a mechanism constructed and adapted to apply at least one of the grouping 
rules to the data to generate grouped data; 

a mechanism constructed and adapted to provide at least one user-defined 
scoring rule for scoring the grouped data according to user-defined scores; and 

a mechanism constructed and adapted to apply at least one of the scoring 
rules to the grouped data to score the grouped data. 

28. A system of operating on data, the system comprising: 

a mechanism constructed and adapted to generate grouped data by 
applying to the data at least one user-defined grouping rule for grouping the data 
into a user-definable number of groups; and 

a mechanism constructed and adapted to score the grouped data by 
applying to the grouped data at least one user-defined scoring rule for scoring the 
grouped data according to user-defined scores. 



78 



wo 01/08039 




PCT/USOO/20401 



29. A system according to claim 27 or 28 wherein the data comprises a 
number of parameters for each of a number of cases and the scoring rule 
comprises a scoring function of user-selectable parameters and user-defined 
weights for the selected parameters to be used in scoring the cases, wherein the 
mechanism constmcted and adapted to score of the grouped data comprises: 

a mechanism constmcted and adapted to apply the function to the data to 
obtain a score for each case. 

30. A system according to claim 29, further comprising: 

a mechanism constructed and adapted to sort the scored cases by score. 

31. A system according to claim 30, wherein the scored cases are 
sorted individually. 

32. A system according to claim 30, wherein the scored cases are 
sorted by cluster. 

33 . A computer-readable memory medium encoded with program data 
representing a computer program that can cause a computer to implement a 
method of operating on data, the method comprising: 

providing at least one user-defined grouping rule for grouping the data into 
a user-definable number of groups; and 

applying at least one of the grouping rules to the data. 
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34. A medium as in claim 33 wherein the data are provided in a table 
and wherein the at least one grouping rule applies to at least one user-selectable 
column of the table. 

35. A medium as in claim 33 wherein the at least one grouping rule 
defines breakpoints corresponding to the user-definable number of groups, and 
wherein application of the at least one rule to the data divides the data into groups 
based on the breakpoints. 

36. A medium as in claim 33, wherein the method fiirther comprises: 
presenting the grouped data in a manner that visually distinguishes the 

groups. 

37. A medium as in claim 36 wherein the grouping rules associate 
colors with groups and wherein the presenting of the grouped data further 
comprises: 

coloring an aspect of the data according to the rules. 

38. A medium as in claim 36, wherein the data are in labeled columns 
in a spreadsheet, and wherein the at least one grouping mle specifies at least one 
brea]qK)int and a corresponding color for each at least one breakpoint, and wherein 
the presenting of the grouped data comprises: 

coloring each data item in the at least one labeled column of the data based 
on the at least one. breakpoint and the corresponding color of the at least one 
bieakpoint. 
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39. A medium as in any one of claims 35 and 38, wherein the 
brealq3oints are selected from: (a) numeric values; and (b) textual values. 

40. A medium as in claim 35 wherein the at least one breakpoint is 
determined automatically based on the data. 

41. A mediimi as in claim 37 wherein the data are provided in a table, 
wherein the coloring of an aspect of the data comprises: 

coloring backgrounds of table cells according to the rules. 

42. A mediimi as in claim 33 wherein the number of groups is fewer 
than a number of possible data values. 

43. A computer-readable memory medium encoded with program data 
representing a computer program that can cause a computer to implement a 
method of operating on data, the method comprising: 

providing at least one user-defined grouping rule for grouping the data into 
a user-definable number of groups; 

applying at least one of the grouping rules to the data to generate grouped 

data; 

providing at least one user-defined scoring rule for scoring the grouped 
data according to user-defined scores; and 

applying at least one of the scoring rules to the grouped data to score the 
grouped data. 
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44. A computer-readable memory mediimi encoded with program data 
representing a computer program that can cause a computer to implement a 
method of operating on data, the method comprising: 

generating grouped data by applying to the data at least one user-defined 
grouping mle for grouping the data into a user-definable nvunber of groups; and 

scoring the grouped data by applying to the grouped data at least one user- 
defined scoring rule for scoring the grouped data according to user-defined scores. 

45. A medium according to claim 43 or 44, wherein the data comprises 
a number of parameters for each of a number of cases and the scoring rule 
comprises a scoring function of user-selectable parameters and user-defined 
weights for the selected parameters to be used in scoring the cases, wherein the 
scoring of the grouped data comprises: 

applying the function to the data to obtain a score for each case. 

46. A medium according to claim 44, the method further comprising: 
sorting the scored cases by score. 

47. A medium according to claim 46, wherein the scored cases are 
sorted individually. 

48. A medium according to claim 46, wherein the scored cases are 
sorted by cluster. 
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